-
Notifications
You must be signed in to change notification settings - Fork 41
feat(rust/sedona-spatial-join-gpu): Add GPU-accelerated spatial join support #465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(rust/sedona-spatial-join-gpu): Add GPU-accelerated spatial join support #465
Conversation
|
Will merge the patch to fix the failing example build once this is merged - #486 |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
In addition to specific comments, I'm concerned about the proliferation of conditional compilation (partiuclarly within sedona-spatial-join, which is a rather important part of our engine to keep clean).
At a high level what sedona-spatial-join-gpu is doing is more like sedona-spatial-join-extension: it provides a simpler (FFI-friendly) mechanism to inject a join operator without dealing with the DataFusion-y details. I think most of the conditional compilation/dead code/unused/ignore directives could be avoided if we add a CPU join extension and use that for all the tests. The GPU extension itself would then be a runtime implementation detail (eventually loaded at runtime via FFI).
| // Helper execution plan that returns a single pre-loaded batch | ||
| struct SingleBatchExec { | ||
| schema: Arc<Schema>, | ||
| batch: RecordBatch, | ||
| props: datafusion::physical_plan::PlanProperties, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very similar to SessionContext::register_batch() and is a lot of lines of code. Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has been removed
| /// Generate random points within a bounding box | ||
| fn generate_random_points(count: usize) -> Vec<String> { | ||
| use rand::Rng; | ||
| let mut rng = rand::thread_rng(); | ||
| (0..count) | ||
| .map(|_| { | ||
| let x: f64 = rng.gen_range(-180.0..180.0); | ||
| let y: f64 = rng.gen_range(-90.0..90.0); | ||
| format!("POINT ({} {})", x, y) | ||
| }) | ||
| .collect() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a random geometry generator in sedona-testing (that is used in the non-GPU join tests and elsewhere) that I think we should be using here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangfengcdt Hi Feng, do we still need this benchmark program? It compares to a brute-force CPU implementation.
| sedona_libgpuspatial::SpatialPredicate::Intersects, | ||
| ), | ||
| device_id: 0, | ||
| batch_size: 8192, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Option<usize> so that it can default to the datafusion.batch_size setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_size does not exist anymore. To fully exploit the parallelism of the GPU, users need to manually set datafusion.execution.batch_size to a very large number. I want to overwrite this value when the GPU feature is enabled, but haven't figured out the right place to insert this logic.
| let properties = PlanProperties::new( | ||
| eq_props, | ||
| partitioning, | ||
| EmissionType::Final, // GPU join produces all results at once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking that this is correct (I thought that because one side is streaming the output might be incremental?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has been completely rewritten.
| /// GPU backend for spatial operations | ||
| #[allow(dead_code)] | ||
| pub struct GpuBackend { | ||
| device_id: i32, | ||
| gpu_context: Option<GpuSpatialContext>, | ||
| } | ||
|
|
||
| #[allow(dead_code)] | ||
| impl GpuBackend { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these dead code markers be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has been removed.
| let kernels = scalar_kernels(); | ||
| let sedona_type = SedonaType::Wkb(Edges::Planar, lnglat()); | ||
|
|
||
| let _cpu_testers: std::collections::HashMap<&str, ScalarUdfTester> = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this variable is not used / can we do this using a for loop to avoid this indirection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was incomplete, but it has now been completed.
rust/sedona-spatial-join/src/exec.rs
Outdated
|
|
||
| #[cfg(feature = "gpu")] | ||
| #[tokio::test] | ||
| #[ignore] // Requires GPU hardware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to figure out a way to not ignore tests in this repo (in this case I think these tests shouldn't exist if the gpu feature isn't enabled so we shouldn't need the ignore it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| SpatialRelationType::Intersects => LibGpuPred::Intersects, | ||
| SpatialRelationType::Contains => LibGpuPred::Contains, | ||
| SpatialRelationType::Covers => LibGpuPred::Covers, | ||
| SpatialRelationType::Within => LibGpuPred::Within, | ||
| SpatialRelationType::CoveredBy => LibGpuPred::CoveredBy, | ||
| SpatialRelationType::Touches => LibGpuPred::Touches, | ||
| SpatialRelationType::Equals => LibGpuPred::Equals, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move SpatialRelationType to sedona-geometry or sedona-common to avoid two copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have extracted SpatialRelationType into a separate file under sedona-geometry. I involves some changes in sedona-spatial-join package, I'd like to ask @Kontinuation whether this change is appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a unified spatial-join package. This piece of code has been removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git submodule update --recursive should remove this diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
python/sedonadb/Cargo.toml
Outdated
| default = ["mimalloc"] | ||
| mimalloc = ["dep:mimalloc", "dep:libmimalloc-sys"] | ||
| s2geography = ["sedona/s2geography"] | ||
| gpu = ["sedona/gpu"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't have any tests in Python for this feature I suggest leaving this out for now (a follow-up PR could add Python support + a test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this feature.
af9f3b6 to
470eb03
Compare
|
Hi @zhangfengcdt @paleolimbot, the GPU-based spatial join has been completely rewritten. The initial version wrapped both the filtering and refinement stages into the same interface (C library). However, that approach suffered from build-side parsing overhead because it could not be shared across partitions. I have redesigned it to match the CPU-based join, where the two stages are separated. This version runs faster and takes less memory. It also supports WHERE clauses (which the initial version did not). The implementation is based on @Kontinuation 's spatial-join, so please give me some advice if you can. I realize this PR is large, but the structural changes were necessary. Please review it when you have time. Thanks! |
|
The modifications made to the existing CPU spatial join code is pretty trivial. If I understand it correctly, it has only moved I still need to take some time looking into sedona-spatial-join-gpu. It is enormous amount of code. |
| tokio = { workspace = true } | ||
| mimalloc = { workspace = true, optional = true } | ||
| libmimalloc-sys = { workspace = true, optional = true } | ||
| env_logger = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I favored it a lot. I had it when testing my local out-of-core spatial join branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU spatial join code is mostly the same as the CPU code. The difference is mainly in the spatial index.
Notably, GPU index performs filters and refinements in batches, while the current spatial index in CPU spatial join works with one probe geometry at a time. I have already refactored the CPU spatial index to work with batches for better performance: https://github.com/Kontinuation/sedona-db/blob/485d45fb95fd278b47253bfc2cb1f3fd93798075/rust/sedona-spatial-join/src/index/spatial_index.rs#L437-L460. This will be submitted to upstream soon, and I believe that it is quite feasible to unify the code for CPU and GPU based spatial join.
I am fine with duplicating code for GPU spatial join, we can do some refactoring to unify them later on.
c/sedona-libgpuspatial/src/lib.rs
Outdated
| index.create_context(&mut ctx); | ||
|
|
||
| // Get results | ||
| let build_indices = joiner.get_build_indices_buffer(context).to_vec(); | ||
| let stream_indices = joiner.get_stream_indices_buffer(context).to_vec(); | ||
| // Push stream data (probe side) and perform join | ||
| unsafe { | ||
| index.probe(&mut ctx, rects.as_ptr() as *const f32, rects.len() as u32)?; | ||
| } | ||
|
|
||
| Ok((build_indices, stream_indices)) | ||
| // Get results | ||
| let build_indices = index.get_build_indices_buffer(&mut ctx).to_vec(); | ||
| let probe_indices = index.get_probe_indices_buffer(&mut ctx).to_vec(); | ||
| index.destroy_context(&mut ctx); | ||
| Ok((build_indices, probe_indices)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx will leak when index.probe fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| self.build_batch.batch = arrow::compute::concat_batches(&schema, all_record_batches) | ||
| .map_err(|e| { | ||
| DataFusionError::Execution(format!("Failed to concatenate left batches: {}", e)) | ||
| })?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is necessary to concatenate all the batches and gs.push_build only once. It would be easier for us to unify the CPU and GPU based spatial join if we support push_build multiple times.
There might be some performance implications switching to multiple gc.push_build calls. If it is not feasible, we need to be aware that concatenating batches doubles the memory requirement, so we need to reserve more memory for building GPU indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is feasible to call push_build many times, but it may slightly increase the build time. I think it is worth making this change to unify the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added an option that enables/disables the concatenation operations.
01d495b to
58919b0
Compare
|
Hi @Kontinuation, I have merged the GPU code path into sedona-spatial-join, so there's no duplicated code issue. Could you review the changes to the spatial-join package? |
Kontinuation
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall approach looks good to me.
| pub(crate) trait SpatialIndexFull: SpatialIndex + SpatialIndexInternal {} | ||
|
|
||
| let index = builder.finish().unwrap(); | ||
| impl<T> SpatialIndexFull for T where T: SpatialIndex + SpatialIndexInternal {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer defining the SpatialIndex trait as one rather than splitting it into external and internal interfaces. At least for now SpatialIndex is only intended to be used by spatial join so we don't need to define external interfaces for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed this. Now, it only has a unified interface.
rust/sedona-spatial-join/src/exec.rs
Outdated
| let use_gpu = cfg!(feature = "gpu"); | ||
|
|
||
| let mark_exec = SpatialJoinExec::try_new( | ||
| original_exec.left.clone(), | ||
| original_exec.right.clone(), | ||
| original_exec.on.clone(), | ||
| original_exec.filter.clone(), | ||
| &join_type, | ||
| None, | ||
| use_gpu, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still run CPU spatial join tests when GPU feature is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially thought we would run tests twice with and without --features gpu. This can eliminate duplicated code. However, I think it is more natural to run both CPU and GPU tests when using --features gpu. I have updated the tests accordingly.
| } | ||
| /// Build visited bitmaps for tracking left-side indices in outer joins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add empty lines in between methods. I'm not sure if there's a cargo fmt option for automatically doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
…support This commit introduces GPU-accelerated spatial join capabilities to SedonaDB, enabling significant performance improvements for large-scale spatial join operations. Key changes: - Add new `sedona-spatial-join-gpu` crate that provides GPU-accelerated spatial join execution using CUDA via the `sedona-libgpuspatial` library. - Implement `GpuSpatialJoinExec` execution plan with build/probe phases that efficiently handles partitioned data by sharing build-side data across probes. - Add GPU backend abstraction (`GpuBackend`) for geometry data transfer and spatial predicate evaluation on GPU. - Extend the spatial join optimizer to automatically select GPU execution when available and beneficial, with configurable thresholds and fallback to CPU. - Add configuration options in `SedonaOptions` for GPU spatial join settings including enable/disable, row thresholds, and CPU fallback behavior. - Include comprehensive benchmarks and functional tests for GPU spatial join correctness validation against CPU reference implementations.
…support This commit introduces GPU-accelerated spatial join capabilities to SedonaDB, enabling significant performance improvements for large-scale spatial join operations. Key changes: - Add new `sedona-spatial-join-gpu` crate that provides GPU-accelerated spatial join execution using CUDA via the `sedona-libgpuspatial` library. - Implement `GpuSpatialJoinExec` execution plan with build/probe phases that efficiently handles partitioned data by sharing build-side data across probes. - Add GPU backend abstraction (`GpuBackend`) for geometry data transfer and spatial predicate evaluation on GPU. - Extend the spatial join optimizer to automatically select GPU execution when available and beneficial, with configurable thresholds and fallback to CPU. - Add configuration options in `SedonaOptions` for GPU spatial join settings including enable/disable, row thresholds, and CPU fallback behavior. - Include comprehensive benchmarks and functional tests for GPU spatial join correctness validation against CPU reference implementations.
…support This commit introduces GPU-accelerated spatial join capabilities to SedonaDB, enabling significant performance improvements for large-scale spatial join operations. Key changes: - Add new `sedona-spatial-join-gpu` crate that provides GPU-accelerated spatial join execution using CUDA via the `sedona-libgpuspatial` library. - Implement `GpuSpatialJoinExec` execution plan with build/probe phases that efficiently handles partitioned data by sharing build-side data across probes. - Add GPU backend abstraction (`GpuBackend`) for geometry data transfer and spatial predicate evaluation on GPU. - Extend the spatial join optimizer to automatically select GPU execution when available and beneficial, with configurable thresholds and fallback to CPU. - Add configuration options in `SedonaOptions` for GPU spatial join settings including enable/disable, row thresholds, and CPU fallback behavior. - Include comprehensive benchmarks and functional tests for GPU spatial join correctness validation against CPU reference implementations.
…support This commit introduces GPU-accelerated spatial join capabilities to SedonaDB, enabling significant performance improvements for large-scale spatial join operations. Key changes: - Add new `sedona-spatial-join-gpu` crate that provides GPU-accelerated spatial join execution using CUDA via the `sedona-libgpuspatial` library. - Implement `GpuSpatialJoinExec` execution plan with build/probe phases that efficiently handles partitioned data by sharing build-side data across probes. - Add GPU backend abstraction (`GpuBackend`) for geometry data transfer and spatial predicate evaluation on GPU. - Extend the spatial join optimizer to automatically select GPU execution when available and beneficial, with configurable thresholds and fallback to CPU. - Add configuration options in `SedonaOptions` for GPU spatial join settings including enable/disable, row thresholds, and CPU fallback behavior. - Include comprehensive benchmarks and functional tests for GPU spatial join correctness validation against CPU reference implementations.
b93eb39 to
adbfbb5
Compare
|
Here's the evaluation result from a variant of Sedona-SpatialBench Each query was executed 5 times with one round of warmup to cache datasets in buff/cache, and the average times are reported. Using GPU join is faster for queries with heavy joins, such as Q10 and Q11. Commit: b690b12
Commit: e106ce6
|
|
@pwrliang Great work! |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! This is a substantial improvement on the initial work (which was also a substantial improvement!). In particular I am grateful for your work to ensure that the integration with the existing sedona-spatial-join code.
I think this would work best as at least two PRs:
- Update the C++ library portion with the new approach (I left review comments on the C++ portion), since this is nicely contained
- Add a Rust wrapper around
gpuspatial_c.hwith a few "is it plugged in" tests - Integrate with sedona-spatial-join
In addition to being easier for both the community and LLMs to review, @Kontinuation has rather heroically split a very large change implementing larger-than-memory spatial joins into small PRs and splitting out the C library change and the wrapper first may give him a few extra days to merge or align conflicting pieces.
This C API is great and is much better than what we initially came up with when we started this project! Now that it has been more thought out, I left some comments to perhaps help polish it a bit. It would help to ensure every member has docstrings now that you've nicely showed that it works!
I'm not a CUDA expert and I'll have to rely on copilot to review that bit when it's in a dedicated PR. It seems like you have great test coverage and were kind to ensure the licensing and TODOs were all handled.
| void* probe_indices; | ||
| }; | ||
|
|
||
| struct GpuSpatialIndexFloat2D { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this SedonaFloatIndex2D?
| * | ||
| * @return 0 on success, non-zero on failure | ||
| */ | ||
| int (*init)(struct GpuSpatialIndexFloat2D* self, struct GpuSpatialIndexConfig* config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move the struct GpuSpatialIndexConfig parameter to GpuSpatialIndexFloat2DCreate() (which I think is possible), this structure will no longer be GPU-specific (which will make it easier to test by implementing on the CPU or implement with Apple Metal later)
| }; | ||
|
|
||
| struct GpuSpatialJoinerContext { | ||
| struct GpuSpatialIndexContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more future-proof to keep this as last_error and private_data. (Or maybe just private_data, with the last error accessible via a callback on the GpuSpatialIndexFloat2D). It seems like the intention is this object is opaque to the user of this API except to pass to methods that support concurrent access.
| void (*get_build_indices_buffer)(struct GpuSpatialIndexContext* context, | ||
| void** build_indices, uint32_t* build_indices_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the build_indices be any more strongly typed here? (Or can the docstring be more specific about what exactly it contains?)
| void (*get_probe_indices_buffer)(struct GpuSpatialIndexContext* context, | ||
| void** probe_indices, uint32_t* probe_indices_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the probe_indices be any more strongly typed here? (Or can the docstring be more specific about what is expected?)
| struct GpuSpatialRefiner { | ||
| int (*init)(struct GpuSpatialRefiner* self, struct GpuSpatialRefinerConfig* config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, can this be a SedonaSpatialRefiner with the GpuSpatialRefinerConfig moved to GpuSpatialRefinerCreate()? We can (and should!) reuse this struct for things that are not just NVIDIA GPUs.
| int (*clear)(struct GpuSpatialRefiner* self); | ||
|
|
||
| int (*push_build)(struct GpuSpatialRefiner* self, | ||
| const struct ArrowSchema* build_schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to pass the build_schema and the probe_schema separately (maybe as arguments to init?) and remove them from the other methods? (Totally fine if this is not possible, just might help reduce the number of arguments to the other callbacks)
| const struct ArrowArray* array2, | ||
| enum GpuSpatialRelationPredicate predicate, uint32_t* indices1, | ||
| uint32_t* indices2, uint32_t indices_size, uint32_t* new_indices_size); | ||
| void (*release)(struct GpuSpatialRefiner* self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last_error should probably be a callback (as above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some .cc files and some .cpp files...if it's easy to rename these I usually see projects choose one and rename files to match (except for vendored files).
| // specific language governing permissions and limitations | ||
| // under the License. | ||
| #include "gpuspatial/index/detail/rt_engine.hpp" | ||
| #include "gpuspatial/rt/rt_engine.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, if .h and .hpp are both used it usually means that .h files are intended for C usage. If it's easy to rename the .h files intended for C++ usage it may look slightly cleaner.
This commit introduces GPU-accelerated spatial join capabilities to SedonaDB, enabling significant performance improvements for large-scale spatial join operations.
Key changes:
sedona-spatial-join-gpucrate that provides GPU-accelerated spatial join execution using CUDA via thesedona-libgpuspatiallibrary.GpuSpatialJoinExecexecution plan with build/probe phases that efficiently handles partitioned data by sharing build-side data across probes.GpuBackend) for geometry data transfer and spatial predicate evaluation on GPU.SedonaOptionsfor GPU spatial join settings including enable/disable, row thresholds, and CPU fallback behavior.